ISSS608 AY2020-21_T2 Visual Analytic DataViz Assignment

How to detect unusual patterns in news articles

Jihun NAM , true
04-11-2021

1. Overview

1.1. Purpose of Visualisation

Public gather information from news media and share it via social platform if the information is useful. Therefore, analysing news articles is helpful in understanding public’s interests. However, making sense of the news is extremely difficult because (1) everyday huge volume of articles are publishing from multiple media sites, (2) extracting topics and information from each article is not easy, (3) the importance of what people think and what the media think can be different. This project aims to analyse local news headlines published in 2020 and provide readers with useful snapshots of news trend. The scope of this project is to visualise the trends of published articles over time by topics and by media sites. In addition, detecting unusual patterns in news articles will be shown, using time series anomaly detection techniques.

Analysing engagement metrics (the number of Likes, Shares, and Comments in social media) from each article would be helpful to understand public’s interests. Newswhip also used the engagement figures to predict the likelihood articles go viral. By visualising daily patterns of engagement figures by media and by topic, we recognise anomalous pattern and public interests more easily.

1.2. Sketch of Proposed Visualisation Design

1.2.1. One-way ANOVA test for the average daily interaction by topic and by media sites

anova

1.2.2. Time series anomaly detection for the average daily interaction by topic and by media sites

anomaly

2. Pre-process Data Set

2.1. Prepare data set

The raw data set was extrated from Prrt which is providing Facebook engagement metrics. The data set contains news headlines, media sites, and date from 1 January 2020 to 31 December 2020. These news articles were published by local media sites, a mix of mainstream (Straits Times, Channel News Asia, AsiaOne) and non-mainstream (Mothership, MustShareNews, The Independent) sites. From this raw data (224k articles), my team decided to filter out articles whose engagement was lower than 10 interactions. Doing this reduces the number of articles in the dataset by 131,235 (or close to 58%) and removes articles that were: (1) removed quickly after A/B testing, (2) removed quickly after editorial corrections, and (3) republished from international news wire agencies and therefore had low to no engagement Finally, 93,166 articles will be used for further analysis. Furthermore, we will use the result of topic modeling for observing trend and detecting anomalous patterns.

2.2. Install and Load R packages

The libraries below were used in this analysis. - tydyverse: data manipulation and exploration - plotly: interactive web graphics - tibbletime/tibble, timetk: data manipulation for time series - ggstatsplot: interactive web charts - anomalize: anomaly detection

packages = c('tidyverse', 'readxl', 'dplyr', 'skimr', 'tibble', 
             'ggstatsplot','ggpubr','corrplot','seriation', 'dendextend',
             'tibbletime', 'anomalize', 'timetk','multcomp','devtools', 'car', 
             'DT', 'reshape2', 'plotly', 'ggforce','lubridate','tidyquant','ggplot2')

for(p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p, character.only = T)
}

2.3. Import data file

Import data set which contains topic label (93,116 articles).

Skim the imported data set. As seen below, it contains 93116 articles, posted by 6 media sites (Source), 20 unique topics (Topic) from “2020-01-01” to “2020-12-31”. We will use “Total” field for analysing engagement metrics.

Table 1: Data summary
Name article_raw
Number of rows 93166
Number of columns 11
_______________________
Column type frequency:
character 5
Date 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Headline 0 1 4 206 0 90945 0
Link 0 1 34 214 0 93165 0
Source 0 1 3 25 0 6 0
Cleaned 4 1 2 168 0 88810 0
Topic 0 1 5 21 0 20 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2019-12-31 2020-12-31 2020-06-12 367

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Doc_ID 0 1 115604.79 62212.55 1 65974.25 114221.5 168194.8 224121 ▅▆▇▅▆
Total 0 1 1331.35 5700.90 10 38.00 223.0 884.0 685790 ▇▁▁▁▁
Likes 0 1 871.38 3799.74 0 20.00 146.0 572.0 384518 ▇▁▁▁▁
Shares 0 1 258.56 1200.00 0 11.00 37.0 146.0 209241 ▇▁▁▁▁
Comments 0 1 201.41 970.54 0 3.00 26.0 130.0 92031 ▇▁▁▁▁

2.4. Handling date field

article_raw$Date <- as.Date(article_raw$Date)
article_raw$month <- as.numeric(format(as.Date(article_raw$Date), "%m"))
article_raw$month <- as.factor(article_raw$month)

2.5. Aggregate data set by Date, Source, and Topic

Engagement figures should be aggregated by Date, Source, and Topic for time sereis analysis.

article_daily <- article_raw %>% 
  group_by(Source, Date, Topic) %>% 
  summarise(mean_interaction =  mean(Total, na.rm= TRUE))
article_daily$mean_interaction <- round(article_daily$mean_interaction, 0)
datatable(article_daily)

2.6. Create and look at an aggregated data set

We can create and see a categorical representation of data with variable name and the frequency in the form of a table using table() function. As you can see table below, each media/topic has different number of values. Each value represents the number of days a certain topic of articles published by a media. For example, MSN published articles regarding topic “t_14” in 298 days, however, it published topic ‘t_16’ articles only 6 days. This imbalance affects ways of ANOVA test.

  1. The number of values (daily interactions by topic and by media)
count_table <- table(article_daily$Topic, article_daily$Source) 
count_table
                       
                        Asiaone Channel News Asia Mothership MSN
  Covid-Affected Places     276               322        336 258
  Covid-Air Travel          254               362        201  75
  Covid-Daily Cases         236               358        336 231
  Covid-Domitory Cases      229               296        233 166
  Covid-Economy             284               363        181  60
  Covid-Education           275               335        258 139
  Covid-Food Delivery       313               306        334 244
  Covid-Outbreaks           279               354        156  75
  Crime                     294               330        288 140
  Entertainment News        345               325        280 134
  Financial Markets         213               355        115  33
  Global Politics           171               363        169  20
  Malaysia                  253               340        210 110
  Public Transport          290               335        264 155
  SG Accidents              286               257        327 242
  SG Elections              153               208        201 105
  SG Fake News              247               317        238  99
  SG Policy                 237               324        197  99
  Sports                    218               343        162  53
  World News                262               366        162  50
                       
                        The Independent Singapore The Straits Times
  Covid-Affected Places                        71               354
  Covid-Air Travel                            109               359
  Covid-Daily Cases                           101               359
  Covid-Domitory Cases                        111               343
  Covid-Economy                               152               366
  Covid-Education                             105               363
  Covid-Food Delivery                         112               338
  Covid-Outbreaks                             129               358
  Crime                                       135               345
  Entertainment News                          211               344
  Financial Markets                            92               362
  Global Politics                             181               361
  Malaysia                                    172               361
  Public Transport                            141               350
  SG Accidents                                153               336
  SG Elections                                294               283
  SG Fake News                                217               348
  SG Policy                                   143               361
  Sports                                      130               360
  World News                                   74               358
  1. Average daily interactions by topic and by media
article_daily_2 <- article_raw %>% 
  group_by(Source, Topic) %>% 
  summarise(mean_interaction =  mean(Total, na.rm= TRUE))

article_daily_2$mean_interaction <- round(article_daily_2$mean_interaction, 0)

mean_table <- dcast(article_daily_2, Topic~Source)
mean_table
                   Topic Asiaone Channel News Asia Mothership  MSN
1  Covid-Affected Places     295              2179       4016 2848
2       Covid-Air Travel     147              1145       3491 2094
3      Covid-Daily Cases     113              3507       1974 1391
4   Covid-Domitory Cases     265              3260       4000 2241
5          Covid-Economy     117               843       2173 1731
6        Covid-Education     171              1571       2507 3197
7    Covid-Food Delivery     353              1323       4385 2769
8        Covid-Outbreaks     453              1226       2549 3535
9                  Crime     269              2067       3159 1751
10    Entertainment News     316              1092       3698 3747
11     Financial Markets      99               602       2461 1333
12       Global Politics     111               641       1521 1506
13              Malaysia     212              1993       3684 2565
14      Public Transport     324              1486       3757 2099
15          SG Accidents     427              2555       4592 2436
16          SG Elections     254              2002       2079 1195
17          SG Fake News     179              1702       2219 1523
18             SG Policy     206              1263       2657 1658
19                Sports     147               666       2312 2097
20            World News     198              1049       4037 1893
   The Independent Singapore The Straits Times
1                        201              1072
2                        126               851
3                        171              3251
4                        415              2840
5                        376               377
6                        321              1103
7                        417              1075
8                        474               945
9                        125              1233
10                       176               775
11                       132               414
12                        91               328
13                       244              1417
14                       569              1443
15                       307              1376
16                       510               954
17                       372               501
18                       572               649
19                        75               394
20                       101               769

3. ANOVA Test

As seen above table, the average daily interactions seems to be different by media and by topic. For a more accurate analysis, ANOVA (Analysis of Variance) test is needed to determine if there is statistical differences between the means. Depending on the results of the ANOVA test, anomaly detection analysis will be affected.

3.1 Check for Normality

In order to perform ANOVA test, we need to check the normality test. Depending on the results, functions and parameters for ANOVA test will be determined.

3.1.1. Density plot: Provide a visual judgement about whether the distribution is bell shaped or not. All

ggdensity(article_daily$mean_interaction, 
          xlab = "Average daily interaction") + 
  facet_wrap(~ article_daily$Source)

p <- ggdensity(article_daily$mean_interaction, 
          xlab = "Average daily interaction") + 
  facet_wrap(~ article_daily$Topic)
p 

3.1.2. Q-Q Plot: Provide a visual judgement about the correlation between a given sample and the normal distribution.

ggqqplot(article_daily$mean_interaction) + 
  facet_wrap(~ article_daily$Source)

ggqqplot(article_daily$mean_interaction) + 
  facet_wrap(~ article_daily$Topic)

3.1.3. Result of normality test

As seen aggregated data set observations and visualisation graphs, we can summarise that:

  1. Some values (for example, the number of daily average interaction value in MSN - Topic ‘t_16’ is only 6).
  2. All the distributions are not following normal curve.
  3. Based on Q-Q plots, the residuals are not normally distributed.

Therefore, a non-parametric ANOVA test would be conducted.

3.2. Visualisation of ANOVA test

The “ggbetweenstats” package provides plotting the ANOVA test result. We will set type “np”, non-parametric test, for type and “pairwise.comparisons” to compare within the groups.

# for reproducibility
set.seed(123)

# plot
p1 <- ggstatsplot::ggbetweenstats(
      data = article_daily %>% filter(Topic == 'Covid-Air Travel'),
      x = Source,
      y = mean_interaction,
      mean.plotting = TRUE,
      mean.ci = TRUE,
      pairwise.comparisons = TRUE, # display results from pairwise comparisons
      notch = TRUE,
      type = "np",
      title = "Differences of daily interaction by Topic 'Covid-Air Travel'",
      messages = FALSE
    ) 
p1

As seen ANOVA plots above, the average daily interactions by each of topic and media is: the null hypothesis of ANOVA test is that the median values across the media are the same. However, based on the ANOVA test results, the median values of the average daily interactions by topic are statistically significant at 95% confidence interval (p-value <0.05) which suggests that the values are not the same across the media sites.

In conclusion, based on the topics and media sites, people tend to have different engagement on social media. Therefore, we should divide the data set into media and topic when we perform anomaly detection.

4. Time Series Anomaly Detection

When a certain article is more interested, the article has more engagement than others, ie. the higher engagement figures. We can see daily trend of interactions by topics in the charts below, "Daily trend of the average interactions by each topic. However, it is extremely difficult to catch when and which articles were unusally intrigued by people just by looking at the visualisation charts. For this reason, we will use time series amomaly detection technique to find the unusual patterns of articles.

4.1. Visualisation - Daily trend of the average interactions by topic and by media

We can see the details of the information when hover over the line graph. The peak points could be unusual dates, however, more statistical approaches are needed to confirm the anomalous ones.

p_a <- ggplot(data = article_daily %>% filter(Topic=='Covid-Air Travel'), 
             aes(x = Date, y = mean_interaction, color = Source)) +
  geom_line()
p_a<- p_a + ggtitle("Daily trend of the average interaction: Topic 'Covid-Air Travel'")
fig_a <- ggplotly(p_a)
fig_a

4.2. Using the ‘timetk’ package

The R ‘timetk’ package helps to plot, manipulate, an forecast time series data in R.

4.2.1. Pre-process data for anomaly detection

For analysis purpose, we will try to use the filtered data set, (Topic == ‘Covid-Affected Places’ & Source == ‘The Straits Times’). Then select “Date” and “mean_interaction” columns only for fitting for tible format. If you want to find another topic or media site, you can change the values in “topic_filter” or “Source_filter”.

topic_filter = 'Covid-Affected Places'
Source_filter = 'The Straits Times'

df_tmp <- article_daily %>%
  filter(Topic == topic_filter & Source == Source_filter)%>% 
  select(Date,mean_interaction)

df_tibble <- as_tibble(df_tmp)
class(df_tibble)
[1] "tbl_df"     "tbl"        "data.frame"

4.2.2. Visualisation of time series anomaly detection

The plot_anomaly_diagnostics() is a visualtion wrapper for tk_anomaly_diagnostics() group-wise anomaly detection, implements a 2-step process to detect outliers in time series (plot_anomaly_diagnostics (plot_anomaly_diagnostics)).

The decomposition separates the “season” and “trend” components from the “observed” values leaving the “remainder” for anomaly detection. Once “trend” and “season” (seasonality) is removed, anomaly detection is performed on the “remainder”. Anomalies are identified, and boundaries (recomposed_l1 and recomposed_l2) are determined.

The Anomaly Detection Method uses an inner quartile range (IQR) of +/-25 the median.

df_tibble %>% timetk::plot_anomaly_diagnostics(Date,mean_interaction, .facet_ncol = 3)

4.2.3. Find anomalous date and detailed information

In order to extract the anomalous dates with the average of daily interaction figures, we can use tk_anomaly_diagnostics() function. In the table, “observed” value represents the average daily interaction figure and “trend” value represents the normal value around the date.

anomal_list <- df_tibble %>% timetk::tk_anomaly_diagnostics(Date, mean_interaction) %>% filter(anomaly=='Yes')
anomal_list
# A tibble: 22 x 11
   Date       observed season trend remainder seasadj remainder_l1
   <date>        <dbl>  <dbl> <dbl>     <dbl>   <dbl>        <dbl>
 1 2020-03-17    6344. -128.   403.     6070.   6472.       -2624.
 2 2020-03-18    5019.  132.   403.     4484.   4887.       -2624.
 3 2020-03-31   10592  -128.   410.    10311.  10720.       -2624.
 4 2020-04-06   10335.  -79.8  416.     9998.  10415.       -2624.
 5 2020-04-09   13655.  -18.1  420.    13253.  13673.       -2624.
 6 2020-04-20    5285   -79.8  430.     4934.   5365.       -2624.
 7 2020-04-24    5031    37.5  433.     4561.   4994.       -2624.
 8 2020-04-25    4884  -111.   432.     4562.   4995.       -2624.
 9 2020-04-28    3831  -128.   431.     3528.   3959.       -2624.
10 2020-05-06    6040.  132.   430.     5478.   5908.       -2624.
# ... with 12 more rows, and 4 more variables: remainder_l2 <dbl>,
#   anomaly <chr>, recomposed_l1 <dbl>, recomposed_l2 <dbl>

From the anomal_list dataset, we can also find detailed information, such as the articles’ headline and engagement figures, from raw data set. With this detailed information, we can also look through which articles affected heavily to public in a certain day.

# Select anomalous date
date_filter <- anomal_list$Date[1]

# Find trend value (normal average of daily interaction)
anomal_list2 <- anomal_list %>% filter(Date == date_filter)
trend <- round(anomal_list2$trend,0)

# Create anomalous article list with Topic, Source, Date, Headline, Total(Engagement), and Trend columns
anomal_article <- article_raw %>%
  filter(Topic == topic_filter, Source == Source_filter, Date == date_filter) %>% 
  select(Topic, Source, Date, Headline, Total)
anomal_article$Trend <- trend

datatable(anomal_article)

5. Reflection on Advantages of incorporating Interactivity

As seen above, there are multiple subsets of data when we analysed data. The raw data set contains six unique media sites and 20 different topics from 2020-01-01 to 2020-12-31. When we want to find a specific data and result of analysis, static forms of visualisation has a lot of limitations. For example, it would be extremely annoying to change everytime the filter values and compare the results of analysis. If we apply interactive visualisation technique, selecting the filter values could be much easier nd more flexible. Also, interactive visualisation is useful to show the results of analysis in one page view. In order to show everything we analysed before, the length of the report could be extremely long. Therefore, interactive visualisation also can help to readers to see the result at a glance.

6. References